{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## Install Required Packages\n",
        "First, install the necessary packages. OpenAI's Python client library and any specific embedding-related library (like langchain) should be installed."
      ],
      "metadata": {
        "id": "T0zgnwxTQ137"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# ! pip install langchain lancedb openai\n",
        "# ! pip install langchain-community\n",
        "# !pip install requests pypdf\n",
        "# ! pip install PyPDF2\n",
        "# ! pip install rank_bm25\n",
        "# ! pip install tiktoken"
      ],
      "metadata": {
        "id": "0_VZikdncSua"
      },
      "execution_count": 24,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        " ## Set Up Your API Key\n",
        "In Google Colab, you can set your API key by directly assigning it in the notebook or using environment variables. For security, it's best practice to avoid hardcoding sensitive information in your code\n",
        "Set the API Key Using Environment Variables in Cola\n"
      ],
      "metadata": {
        "id": "ZTrniR7xQ7AS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import openai\n",
        "from langchain.embeddings import OpenAIEmbeddings\n",
        "\n",
        "# Directly set your API key here\n",
        "openai_api_key = 'your_key'"
      ],
      "metadata": {
        "id": "xH1MyrT1QgxK"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Access the Environment Variable in Your Code:\n"
      ],
      "metadata": {
        "id": "YXyRES0dRJBM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.vectorstores import LanceDB\n",
        "import lancedb\n",
        "from langchain.retrievers import BM25Retriever, EnsembleRetriever\n",
        "from langchain.schema import Document\n",
        "from langchain.embeddings.openai import OpenAIEmbeddings\n",
        "from langchain.document_loaders import PyPDFLoader\n",
        "\n",
        "\n",
        "# Initialize embeddings for semantic search\n",
        "embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)\n"
      ],
      "metadata": {
        "id": "8sZmXcm4fQYa"
      },
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Download the PDF\n",
        "Before we start let's download required pdfs."
      ],
      "metadata": {
        "id": "buwyoMhlSgRs"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import requests\n",
        "import time\n",
        "\n",
        "def download_pdf(url, save_path, retries=3):\n",
        "    attempt = 0\n",
        "    while attempt < retries:\n",
        "        try:\n",
        "            response = requests.get(url, stream=True)\n",
        "            response.raise_for_status()  # Check if the download was successful\n",
        "            with open(save_path, 'wb') as file:\n",
        "                for chunk in response.iter_content(chunk_size=8192):\n",
        "                    file.write(chunk)\n",
        "            print(f\"Downloaded PDF from {url} to {save_path}\")\n",
        "            return True\n",
        "        except requests.exceptions.RequestException as e:\n",
        "            attempt += 1\n",
        "            print(f\"Error downloading PDF (attempt {attempt} of {retries}): {e}\")\n",
        "            if attempt < retries:\n",
        "                time.sleep(5)  # Wait before retrying\n",
        "    return False\n",
        "\n",
        "# Example URL and file path\n",
        "pdf_url = \"https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf\"\n",
        "pdf_path = \"/content/Food_and_Nutrition.pdf\"\n",
        "\n",
        "# Download the PDF\n",
        "if not download_pdf(pdf_url, pdf_path):\n",
        "    raise Exception(\"Failed to download PDF after multiple attempts\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "j3Fwp59eQb4U",
        "outputId": "13fad0d5-8302-44e4-dd27-73c19a053a59"
      },
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Downloaded PDF from https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf to /content/Food_and_Nutrition.pdf\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Load and Split the PDF\n",
        "Use PyPDFLoader to load and split the PDF into pages."
      ],
      "metadata": {
        "id": "sKqgpHNKSzIB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.document_loaders import PyPDFLoader\n",
        "\n",
        "# Load documents\n",
        "loader = PyPDFLoader(\"Food_and_Nutrition.pdf\")\n",
        "pages = loader.load_and_split()\n"
      ],
      "metadata": {
        "id": "AAGetOkJSx2Y"
      },
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Initialize the BM25 Retriever\n",
        "Set up the BM25 retriever to fetch top results."
      ],
      "metadata": {
        "id": "Lx-OBvmEbFEH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.retrievers import BM25Retriever\n",
        "\n",
        "# Initialize the BM25 retriever\n",
        "bm25_retriever = BM25Retriever.from_documents(pages)\n",
        "bm25_retriever.k = 2  # Retrieve top 2 results using BM25\n"
      ],
      "metadata": {
        "id": "ALbSOdbvS32V"
      },
      "execution_count": 11,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Create LanceDB Vector Store for Semantic Search\n",
        "Connect to LanceDB and create a table for storing embeddings."
      ],
      "metadata": {
        "id": "yJsw2kBTbVys"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import lancedb\n",
        "\n",
        "# Create lancedb vector store for semantic search\n",
        "db = lancedb.connect('lancedb')\n",
        "table = db.create_table(\"pandas_docs\", data=[\n",
        "    {\"vector\": embedding.embed_query(\"Hello World\"), \"text\": \"Hello World\", \"id\": \"1\"}\n",
        "], mode=\"overwrite\")\n"
      ],
      "metadata": {
        "id": "SaaWRWr7bL7U"
      },
      "execution_count": 14,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Initialize LanceDB Retriever\n",
        "Set up the LanceDB retriever for semantic search."
      ],
      "metadata": {
        "id": "CAfUn3Qhc6Az"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.vectorstores import LanceDB\n",
        "from lancedb.db import LanceDBConnection\n",
        "from langchain.embeddings.openai import OpenAIEmbeddings\n",
        "from langchain.document_loaders import PyPDFLoader\n",
        "\n",
        "# Initialize embeddings for semantic search\n",
        "\n",
        "# Establish connection to the LanceDB database\n",
        "# Replace 'your_database_path' with the actual path to your LanceDB database\n",
        "connection = LanceDBConnection('lancedb')\n",
        "\n",
        "# Assume `pages` is a list of Document objects loaded previously\n",
        "# Initialize LanceDB retriever\n",
        "docsearch = LanceDB.from_documents(pages, embedding, connection=connection)\n",
        "\n",
        "# Create a retriever using the LanceDB vector store\n",
        "retriever_lancedb = docsearch.as_retriever(search_kwargs={\"k\": 2})\n"
      ],
      "metadata": {
        "id": "--QcN4X-bsex"
      },
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Initialize the Ensemble Retriever\n",
        "Combine the BM25 and LanceDB retrievers with specified weights."
      ],
      "metadata": {
        "id": "O_fOvcarc-XZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain.retrievers import EnsembleRetriever\n",
        "\n",
        "# Initialize the ensemble retriever with weights\n",
        "ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever_lancedb], weights=[0.4, 0.6])\n"
      ],
      "metadata": {
        "id": "pRM8w--Zc-4Q"
      },
      "execution_count": 21,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Retrieve Relevant Documents\n",
        "Perform a query and retrieve relevant documents using the ensemble retriever."
      ],
      "metadata": {
        "id": "qxRL7lIZdGug"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "query = \"Lorem ipsum dolor sit amet\"\n",
        "\n",
        "# Retrieve relevant documents\n",
        "docs = ensemble_retriever.get_relevant_documents(query)\n",
        "\n",
        "# Print retrieved documents\n",
        "for doc in docs:\n",
        "    print(doc.page_content)"
      ],
      "metadata": {
        "id": "Lrw-RWdIdHFQ",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "b7cd54b6-3bf9-4419-fb5a-0ba0c26dda43"
      },
      "execution_count": 23,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "MINISTRY OF AGRICULTURE,\n",
            "ANIMAL INDUSTRY AND FISHERIES\n",
            "P.O. Box 102 ENTEBBE - UGANDA\n",
            "www.agriculture.go.ug\n",
            "Food and Nutrition Handbook for Extension Workers72\n",
            "Picture 18: A back yard garden and small animals and local chickens\n",
            "Food and Nutrition Handbook for Extension Workers25• Shortage of iodine decreases IQ and causes a productivity loss.\n",
            "• Farmers with low literacy levels are less likely to adopt improved \n",
            "agricultural practices hence leading to poor agricultural production \n",
            "and productivity.\n",
            "• People with low literacy levels are bound to have poor health seek -\n",
            "ing behaviours and access to quality health services.\n",
            "• Mothers with low education level are likely to follow poor feeding \n",
            "practices hence affecting the nutritional and health status of family \n",
            "members.\n",
            "• Contributes to poverty.\n",
            "• Cost of treating illnesses attributable to malnutrition.\n",
            "• Cost of caring for sick.\n",
            "• Lost care for other (not sick) household members.\n",
            "b)\tConsequences \tof\tovernutrition\n",
            "Malnutrition can lead to multiple medical conditions including:\n",
            "• Coronary heart disease (heart attack) \n",
            "• Diabetes (high blood sugar)\n",
            "• Gout (swollen painful joints)\n",
            "• Hypertension (high blood pressure) \n",
            "• Overweight\n",
            "• Obesity \n",
            "• Death\n",
            "Malnutrition increases the risk of death and illnesses\n",
            "Malnutrition weakens immunity and predisposes individuals to different \n",
            "infections.\n",
            "• More than half of infant deaths are associated with malnutrition.\n",
            "• Marasmus and kwashiorkor and finally death are caused by severe \n",
            "malnutrition.\n",
            "• Goitre due to iodine deficiency.\n",
            "• Night blindness to complete blindness from vitamin A deficiency.\n",
            "• Anaemia from iron deficiency.\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [],
      "metadata": {
        "id": "uIveAFyNdVwE"
      }
    }
  ]
}